Python For Data Science Cheat Sheet 


Scikit-Learn 
Learn Python for data science Interactively at www.DataCamp.com 


Scikit-learn is an open source Python library that 
implements a range of machine learning, @. 
preprocessing, cross-validation and visualization 


algorithms using a unified interface. 


A Basic Example 


>>> from sklearn import neighbors, datasets, preprocessing 


>>> from sklearn.model_selection import train_test_split 
>>> from sklearn.metrics import accuracy_score 

>>> iris = datasets.load_iris() 

>>> X, y = iris.data[:, :2], iris.target 

>>> X train, X test, y train, y test= train _test_split (X, y, random_state=33) 
>>> scaler = preprocessing.StandardScaler () .fit (x_train) 
>>> X train = scaler.transform(X_train) 

>>> X_test = scaler.transform(X_test) 

>>> knn = neighbors.KNeighborsClassifier (n_neighbors=5) 
>>> knn.fit(X_train, y train) 

>>> y pred = knn.predict (X_test) 

>>> accuracy score(y test, y pred) 


Loading The Data Also see NumPy & Pandas 


Your data needs to be numeric and stored as NumPy arrays or SciPy sparse 
matrices. Other types that are convertible to numeric arrays, such as Pandas 
DataFrame, are also acceptable. 


>>> import numpy as np 

>>> X = np.random.random( (10,5) ) 

>>> y = np.array(['M','M','F','E','M', 'F', 'M', 'M', 'F','F','F']) 
>>> X[X < 0.7] = 0 


Training And Test Data 


>>> from sklearn.model_selection import train_test_split 
>>> X train, X test, y train, y test = train test _split (X, 
Y; 
random_state=0) 


Create Your Model 


Supervised Learning Estimators 


Linear Regression 
>>> from sklearn.linear_model import LinearRegression 
>>> lr = LinearRegression (normalize=True) 

Support Vector Machines (SVM) 


>>> from sklearn.svm import SVC 
>>> svc = SVC(kernel='linear') 


Naive Bayes 


>>> from sklearn.naive_bayes import GaussianNB 
>>> gnb = 


GaussianNB () 


KNN 


>>> from sklearn import neighbors 
>>> knn = neighbors.KNeighborsClassifier (n_neighbors=5) 


Unsupervised Learning Estimators 


Principal Component Analysis (PCA) 
>>> from sklearn.decomposition import PCA 
>>> pca = PCA(n_components=0.95) 


K Means 
>>> from sklearn.cluster import KMeans 
>>> k_means = KMeans(n_clusters=3, random_state=0) 


Supervised learning 

>>> lr.fit (xX, y) 

>>> knn.fit(X_train, y train) 

>>> svc.fit(X_train, y train) 
Unsupervised Learning 

>>> k_means.fit (X_train) 

>>> pca_model = pca.fit_transform(X train) 


Fit the model to the data 


Fit the model to the data 
Fit to data, then transform it 


Prediction 


Evaluate Your Model’s Performance 


Classification Metrics 


Accuracy Score 
>>> knn.score(X_test, 


y_test) Estimator score method 


>>> from sklearn.metrics import accuracy_score |Metric scoring functions 
>>> accuracy score(y test, y pred) 


Classification Report 
>>> from sklearn.metrics import classification report |Precision, recall, fi-score 
>>> print(classification_ report (y_ test, y pred)) and support 


Confusion Matrix 
>>> from sklearn.metrics import confusion matrix 
>>> print (confusion matrix(y test, y pred) 


Regression Metrics 


Mean Absolute Error 


>>> from sklearn.metrics import mean_absolute_ error 
>>> y true = [3, -0.5, 2] ~ = 
>>> mean_absolute_error(y true, y pred) 

Mean Squared Error 
>>> from sklearn.metrics import mean_squared error 
>>> mean_squared_error(y test, y_ pred) 


R? Score 
>>> from sklearn.metrics import r2_ score 
>>> r2_score(y true, y pred) 


Clustering Metrics 


Adjusted Rand Index 


>>> from sklearn.metrics import adjusted_rand_score 
>>> adjusted rand score(y true, y pred) 


Homogeneity 
>>> from sklearn.metrics import homogeneity score 
>>> homogeneity score(y true, y pred) 


V-measure 
>>> from sklearn.metrics import v_measure_score 
>>> metrics.v_measure score(y true, y pred) 


Cross-Validation 


Supervised Estimators 
>>> y pred = svc.predict (np. random. random ((2,5)))| Predict labels 


>>> y pred = lr.predict (X_test) Predict labels 
>>> y_pred = knn.predict_proba (X test) Estimate probability of a label 
Unsupervised Estimators 


>>> y pred = k_means.predict (X_test) Predict labels in clustering algos 


Preprocessing The Data 


Standardization Encoding Categorical Features 


>>> from sklearn.preprocessing import StandardScaler 
>>> scaler = StandardScaler() .fit (X_train) 

>>> standardized X = scaler.transform(X train) 

>>> standardized X test = scaler.transform(X_test) 


>>> from sklearn.preprocessing import LabelEncoder 
>>> enc = LabelEncoder () 
>>> y = enc.fit_transform(y) 


Imputing Missing Values 


Normalization 
>>> 


from sklearn.preprocessing import Normalizer 


>>> scaler = Normalizer() .fit (X_train) 
>>> normalized X = scaler.transform(X train) 
>>> normalized X test = scaler.transform(X_test) 


>>> from sklearn.preprocessing import Imputer 
>>> imp = Imputer (missing values=0, 
>>> imp.fit_transform(X train) 


strategy='mean', axis=0) 


Binarization 


Generating Polynomial Features 


>>> from sklearn.preprocessing import Binarizer 
>>> binarizer = Binarizer(threshold=0.0) .fit (X) 
>>> binary X = binarizer.transform(X) 


>>> from sklearn.preprocessing import PolynomialFeatures 
>>> poly = PolynomialFeatures (5) 
>>> poly.fit_ transform (X) 


>>> from sklearn.cross validation import cross _val_score 
>>> print(cross_val_score(knn, X_ train, y train, cv=4)) 
>>> print(cross val score(lr, X, y, cv=2)) 


Tune Your Model 
Grid Search 


>>> from sklearn.grid_search import GridSearchCv 

>>> params = {"n neighbors": np.arange (1,3), 
"metric": ["euclidean", "Neitybiock" | } 

>>> grid = GridSearchCV(estimator=knn, 

param_grid=params) 

>>> grid.fit(X train, y train) 

>>> print(grid.best score ) 

>>> print (grid.best_ estimator _.n_ neighbors) 


Randomized Parameter Optimization 


>>> from sklearn.grid_search import RandomizedSearchCv 

>>> params = {"n neighbors": range(1,5), 

"weights": ["uniform", "distance"]} 

>>> rsearch = RandomizedSearchCV(estimator=knn, 

param distributions=params, 
cv=4, 

n iter=8, 

random_state=5) 


>>> rsearch.fit(X_train, y train) 
>>> print (rsearch.best_ score ) 
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